Protein Family Databases for Automated Protein Domain Identification
نویسنده
چکیده
Automatic identification and annotation of protein domains is a major challenge for genome sequencing projects. Simple transfer of the annotation from the overall most similar protein with a known function is relatively reliable for prokaryotic proteins, but often produces misleading and incomplete results for multi-domain proteins, which are common in higher organisms. An alternative approach is to classify protein domains based on matches to a precompiled database of protein domain families. A number of such databases are reviewed here, including an update on the Pfam database. The differences a user can expect to experience when using different databases for domain identification are illustrated by examples of known multi-domain proteins. The advantages and drawbacks of single-sequence versus multiple-alignment methods are also discussed. The degree of protein modularity was surveyed in the genomes of Caenorhabditis elegans, Saccharomyces cerevisiae, and Haemophilus influenzae by matching them to Pfam. While prokaryotic genomes typically have a small fraction of multi-domain proteins, that rarely contain more than three domains, at least 10% of higher eukaryotic proteins have multiple domains, many times with dozens of domains per protein chain.
منابع مشابه
The Protein Information Resource
The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. Family classification is used for sensitive identification, consistent annotati...
متن کاملADDA: a domain database with global coverage of the protein universe
We used the Automatic Domain Decomposition Algorithm (ADDA) to generate a database of protein domain families with complete coverage of all protein sequences. Sequences are split into domains and domains are grouped into protein domain families in a completely automated process. The current database contains domains for more than 1.5 million sequences in more than 40,000 domain families. In par...
متن کاملDesigning a new tetrapeptide to inhibit the BIR3 domain of the XIAP protein via molecular dynamics simulations
The XIAP protein is a member of apoptosis proteins family. The XIAP protein plays a central role in the inhibition of apoptosis and consists of three Baculoviral IAP Repeat domains. The BIR3 domain binds directly to the N-terminal of caspase-9 and therefore it inhibits apoptosis. N-terminal tetrapeptide region of SMAC protein can bind to BIR3, inhibit it and subsequently induce apoptosis. In th...
متن کاملProClass protein family database
ProClass is a protein family database that organizes non-redundant sequence entries into families defined collectively by PROSITE patterns and PIR superfamilies. By combining global similarities and functional motifs into a single classification scheme, ProClass helps to reveal domain and family relationships and classify multi-domain proteins. The database currently consists of more than 120 0...
متن کاملFamily Classification and Integrative Analysis for Protein Functional Annotation
The high-throughput genome projects have resulted in a rapid accumulation of predicted protein sequences, however, experimentally-verified information on protein function lags far behind. The common approach to inferring function of uncharacterized proteins based on sequence similarity to annotated proteins in sequence databases often results in over-identification, underidentification, or even...
متن کامل